Lab Topics

Regression Diagnostics: - Outlier, leverage, and influence - added variable and marginal model plots

Goals:

After this lab you will be able to: - assess and diagnose the extent to which outlying observations are driving your results - assess the impact of a given variablie within a multiple regression model

This lab uses materials by - Angela Dixon - Andrew Bray - Andrew Bray and Mine Cetinkaya-Rundel - Brian Caffo, Jeff Leek, Roger Peng

The linear model:

Model diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

Let’s take a look the usage of “rail trails,” which are trail systems that are built on old rail grades; we will explore the relationship between temperature and ridership. Load in the RailTrail data set in the mosaicData package:

head(RailTrail)
##   hightemp lowtemp avgtemp spring summer fall cloudcover precip volume
## 1       83      50    66.5      0      1    0        7.6   0.00    501
## 2       73      49    61.0      0      1    0        6.3   0.29    419
## 3       74      52    63.0      1      0    0        7.5   0.32    397
## 4       95      61    78.0      0      1    0        2.6   0.00    385
## 5       44      52    48.0      1      0    0       10.0   0.14    200
## 6       69      54    61.5      1      0    0        6.6   0.02    375
##   weekday
## 1       1
## 2       1
## 3       1
## 4       0
## 5       1
## 6       1
  1. Using a scatterplot, examine the bivariate relationship between ridership (volume) and high temperature that day (hightemp):

  1. What type of relationship do you observe? Does a linear model at least appear worth trying? What type of relationship do you expect?

  2. Fit a bivariate regression of volume on hightemp

mod.rider <- lm(volume~hightemp,data=RailTrail)
  1. Describe the estimated relationship between volume of ridership and the high temperature hightemp.

Test whether the conditions for regression appear reasonable:

Condition 1: Linearity

Linearity: You already checked if the relationship between ridership and temperature is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. temperature.

plot(mod.rider$residuals ~ RailTrail$hightemp)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0
  1. Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between ridership and temperature?

Condition 2: Nearly Normal Residuals

Nearly normal residuals: To check this condition, we can look at a histogram

hist(mod.rider$residuals)

or a normal probability plot of the residuals.

qqnorm(mod.rider$residuals)
qqline(mod.rider$residuals)  # adds diagonal line to the normal prob plot
  1. How do these look to you?

Condition 3: Constant variability

Constant variability: There are tests for heteroskedasticity, but generally you can use plots as a rough heuristic at least when doing preliminary fitting. Constant variability means that the variability of points around the regression line remains consistant for all values of X. Thus, we again use the plot of residuals ~ independent variable to check this:

plot(mod.rider$residuals ~ RailTrail$hightemp)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0
  1. Based on the plot, does the constant variability condition appear to be met?

Outliers, leverage, influence

Outliers are points that don’t fit the trend in the rest of the data.

High leverage points have the potential to have an unusually large influence on the fitted model.

Influential points are high leverage points that cause a very different line to be fit than would be with that point removed.

Influence measures

Recall that there are numerous ways of assessing the influence that a given observation has on your model…

  • Do ?influence.measures to see the full suite of influence measures in stats. The measures include
  • rstandard - standardized residuals, residuals divided by their standard deviations)
  • rstudent - standardized residuals, residuals divided by their standard deviations, where the ith data point was deleted in the calculation of the standard deviation for the residual to follow a t distribution
  • hatvalues - measures of leverage
  • dffits - change in the predicted response when the \(i^{th}\) point is deleted in fitting the model.
  • dfbetas - change in individual coefficients when the \(i^{th}\) point is deleted in fitting the model.
  • cooks.distance - overall change in the coefficients when the \(i^{th}\) point is deleted.
  • resid - returns the ordinary residuals
  • resid(fit) / (1 - hatvalues(fit)) where fit is the linear model fit returns the PRESS residuals, i.e. the leave one out cross validation residuals - the difference in the response and the predicted response at data point \(i\), where it was not included in the model fitting.

Find Outliers

Standardized Residuals

One of the more common tools is standardized residuals. Standardized residuals are essentially “z-score residuals”, so if you are using a \(95%\) confidence level, you can take a quick look to see if any standardized residuals are greater than 2 (or 1.96) or less than -2 (-1.96):

plot(rstandard(mod.rider)~ RailTrail$hightemp)
abline(h = 2, lty = 3,col='red')  # adds a horizontal dashed line at y = 2
abline(h = -2, lty = 3,col='red')  # adds a horizontal dashed line at y = 2
  1. How do your standardized residuals look? How many points would you expect to be outside of the dashed lines simply due to random chance (hint: what does the \(95%\) interval really mean)?

Studentized Residuals

Studentized residuals are just like standardized residuals, but we leave out a given point when computing the sd. That way, the very point you are trying to analzye does not factor into the standardization. In all likelihood, in most cases you won’t notice much difference between the two.

plot(rstudent(mod.rider)~ RailTrail$hightemp)
abline(h = 2, lty = 3,col='red')  # adds a horizontal dashed line at y = 2
abline(h = -2, lty = 3,col='red')  # adds a horizontal dashed line at y = 2
  1. Comment on any observed differences between the studentized and standardized residuals.

Assess Leverage

We can assess leverage by plotting hatvalues….

hat <- hatvalues(mod.rider)
plot(hat)

Measure influence

dfbetas and Cook’s Distance are both common ways to measure influence:

dfbetas are the difference in \(\beta_k\) when observation \(i\) is left out of the model. We care about how each point influences \(\beta_1\) in this case, not \(\beta_0\), so we’ll only plot the second column of results.

modbetas = dfbetas(mod.rider)
plot(modbetas[,2])

Cook’s Distance is an alternative measure. Cook’s Distance is a summary metric that captures the total change in all model parameters due to a given observation. For this reason, there is no absolute standard of waht is a “large” or “small” Cook’s distance. The most general criteria is that a point is highly influential when \(D_i>1\). However, the number of observations obviously directly influences the influence any one point can have independent of the values of the observation, so \(D_i > 4/n\) is also a common criteria.

modcooks = cooks.distance(mod.rider)
plot(modcooks)
abline(h=4/nrow(RailTrail),col='red')
  1. Comment on the dfbeta and Cook’s Distance results. What do you think might be driving some of these influential points?

Combinging Leverage, Outliers, and Influence

Bubble Plot

It’s tought to know how useful a bubble plot is, but it’s fun to make!

plot(hatvalues(mod.rider), rstudent(mod.rider), type='n')
cook <- sqrt(cooks.distance(mod.rider))
points(hatvalues(mod.rider), rstudent(mod.rider), cex=10*cook/max(cook))
abline(h=c(-2,0,2), lty=2)
abline(v=c(2,3) * 3/45, lty=2)

What might we do?

robust regression downweights influential data

library(MASS)
mod.rider.robust = rlm(volume~hightemp,data=RailTrail)
summary(mod.rider.robust)
## 
## Call: rlm(formula = volume ~ hightemp, data = RailTrail)
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -253.802  -55.176    8.995   58.812  314.594 
## 
## Coefficients:
##             Value    Std. Error t value 
## (Intercept) -25.3436  56.8986    -0.4454
## hightemp      5.8019   0.8124     7.1421
## 
## Residual standard error: 86.29 on 88 degrees of freedom

You could also delete problem points, but I would strongly recommend avoiding this if at all possible, unless you know with great confidence that a data point is an error and not simply an outlying observation.

library(MASS)
mod.rider.delete = lm(volume~hightemp,data=RailTrail[cooks.distance(mod.rider)<=4/nrow(RailTrail),])
summary(mod.rider.delete)
## 
## Call:
## lm(formula = volume ~ hightemp, data = RailTrail[cooks.distance(mod.rider) <= 
##     4/nrow(RailTrail), ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -262.83  -57.10    3.22   54.76  268.79 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -69.656     57.118  -1.220    0.226    
## hightemp       6.513      0.828   7.866 1.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 94.64 on 84 degrees of freedom
## Multiple R-squared:  0.4242, Adjusted R-squared:  0.4173 
## F-statistic: 61.87 on 1 and 84 DF,  p-value: 1.124e-11
  1. Compare your three models (basic, robust, and point-deletion) - how and why do they differ? Also, how many points did you delete in the deletion model?

Multiple regression diagnostics

Residual plots are somewhat problematic for multiple regression, because we have many different input varaibles. Instead, we will use “added variable plots”

Residual plots, bivariate regression vs multiple regression

In simple linear regression we use residual plots to assess:

  1. Does the mean function appear linear?
  2. Is it reasonable to assume that the errors have constant variance?

Residual plots in multiple regression

We fit the model:

\[ y \sim x_1 + x_2 \]

If this was a bivariate model, we could conclude that the mean function looks fairly linear but there the errors appear to have increasing variance. However, these are fake data generated from a model with constant variance!!!

In MLR, in general, you cannot infer the structure you see in the residuals vs. fitted plot as being the structure that was misspecified.

  • Non-constant variance in the residuals doesn’t neccessarily suggest non-constant variance in the errors.
  • Non-linear structures don’t necessarily suggest a non-linear mean function.

The only conclusion you can draw is that something is misspecified.

Diagnostic Plots for Multiple Regression

So now what?

  • Although several types of invalid models can create non-constant variance in the residuals, a valid model will always be structureless.

  • If you can be sure you have a good mean function, then the residual plot is more informative.

  1. Added Variable Plots: Used to assess a variable’s impact net of other model variables
  2. \(Y\) versus \(\hat{Y}\): used to assess whether the mean function is being modeled well.
  3. Marginal model plots: used to assess whether the mean function between each predictor and the response is being modeled well.

Added variable plots

The objective of constructing an added variable plot is to assess how much each variable adds to your model.

Consider the some data concerning restaurants in NYC, where we’d like to build the model:

\[ Price \sim Food + Decor + Service + East \]

We can assess the isolated effect of each predictor on the response with a series of simple scatterplots…

This might be more efficient…

pairs(Price ~ Food + Decor + Service + East, data = nyc)

But this does not provide a way to look at a variable net of other variables. Instead, an added variable plot tells you how much a given predictor \(x_i\) can explain the response after the other predictors have been taken into account. An “av-plot” has:

Making an avplot by hand

First, get the residuals from the model

\[ Price \sim Decor + Service + East \]

resY <- lm(Price ~ Decor + Service + East, data = nyc)$res

Second, get the residuals from the model

\[ Food \sim Decor + Service + East \]

resX <- lm(Food ~ Decor + Service + East, data = nyc)$res

The plot them against each other…

plot(resY ~ resX)

Making an avplot in 2 seconds…

The car package has an avPlots() function that does this for you…

library(car)
m1 <- lm(Price ~ Food + Decor + Service + East, data = nyc)
avPlot(m1,variable = "Food")

avPlots(m1)

Notice that if we fit a line through the AVP, the slope should look familiar…

AVPm1 <- lm(resY ~ resX)
AVPm1$coef
## (Intercept)        resX 
## 5.07403e-17 1.53812e+00
m1$coef
##   (Intercept)          Food         Decor       Service          East 
## -24.023799670   1.538119941   1.910087113  -0.002727483   2.068050156

How to use AVPs

  1. AVPs can be used to assess whether it makes sense to include an additional variable in the model (similar to looking at the p-value of the predictor).

  2. They’re a bit more informative, though, since they would also indicate if the relationship between that predictor and the response is linear in the context of the other variables.

Multiple Regression Practice

Let’s look at home prices in LA…

In the data set LA, this scatterplot suggests two influential points but are they influential in a multiple regression model?

  1. Fit the model \(\hat{price} \sim sqrt + bed + city\). By the rules of thumb, are those two points high leverage? Outliers? (you can extract the hat values using influence(m1)$hat.)

High leverage?

LA <- read.csv("http://andrewpbray.github.io/data/LA.csv")
m1 <- lm(price ~ sqft + bed + city, data = LA)
levs <- influence(m1)$hat

High leverage?

##       1255       1289       1277       1293       1294       1292 
## 0.04202814 0.04301286 0.04826781 0.07888082 0.10423838 0.15779785

High residual?

e_hat <- m1$res
s <- sqrt(1/(nrow(LA) - length(m1$coef)) * sum(e_hat^2))
r <- e_hat/(s * sqrt(1 - levs))
head(r)
##          1          2          3          4          5          6 
## -0.1251458 -0.1449723 -0.1155992 -0.5684922 -1.0500985  0.2047673
head(rstandard(m1))
##          1          2          3          4          5          6 
## -0.1251458 -0.1449723 -0.1155992 -0.5684922 -1.0500985  0.2047673

High residual?

hist(r)

tail(sort(r))
##      1193      1287      1288      1096      1294      1292 
##  3.579264  3.664149  3.664149  3.744932 14.659810 27.695804

Influence

cdist <- (r^2 / length(m1$coef)) * (levs/(1 - levs))
tail(sort(cdist))
##        1254        1260        1291        1289        1294        1292 
##  0.05408505  0.05847561  0.08015348  0.09034382  4.16812407 23.95308503

Influence

plot(m1, 5)

  1. Calculate the Cook’s distance of those two observations.

  2. Generate the Cook’s distance plot.

  3. Look at the avplots: #### AV Plots

library(car)
avPlots(m1)

  1. Now fit the more appropriate model, with \(logprice\) and \(logsqrt\) and construct added variable plots. What do you learn about the relative usefulness of \(logsqft\) and \(bed\) as predictors?
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
LA <- mutate(LA, logprice = log(price), logsqft = log(sqft))
m2 <- lm(logprice ~ logsqft + bed + city, data = LA)
summary(m2)
## 
## Call:
## lm(formula = logprice ~ logsqft + bed + city, data = LA)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26020 -0.24897 -0.01613  0.21804  1.37277 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.13068    0.21201  24.200   <2e-16 ***
## logsqft           1.20729    0.03036  39.769   <2e-16 ***
## bed              -0.03010    0.01284  -2.345   0.0191 *  
## cityLong Beach   -0.88280    0.03467 -25.464   <2e-16 ***
## citySanta Monica -0.09416    0.04022  -2.341   0.0194 *  
## cityWestwood     -0.46244    0.04876  -9.484   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3704 on 1588 degrees of freedom
## Multiple R-squared:  0.8742, Adjusted R-squared:  0.8738 
## F-statistic:  2206 on 5 and 1588 DF,  p-value: < 2.2e-16
avPlots(m2)

Overall, this model should look quite a bit better.

par(mfrow=c(2,2))
plot(m2)

Marginal Model Plots

Example: Defective widgets

defects <- read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/defects.txt",
                      header = TRUE)
head(defects)
##   Case Temperature Density  Rate Defective
## 1    1        0.97   32.08 177.7       0.2
## 2    2        2.85   21.14 254.1      47.9
## 3    3        2.95   20.65 272.6      50.9
## 4    4        2.84   22.53 273.4      49.7
## 5    5        1.84   27.43 210.8      11.0
## 6    6        2.05   25.42 236.1      15.6

Let’s look at the pairwise comparisons…

pairs(Defective ~ Temperature + Density + Rate, data = defects)

Here’s our basic model…

\[ \widehat{Defective} \sim Temperature + Density + Rate \]

##               Estimate Std. Error    t value   Pr(>|t|)
## (Intercept) 10.3243900 65.9264798  0.1566046 0.87676619
## Temperature 16.0779089  8.2941050  1.9384742 0.06349394
## Density     -1.8272927  1.4970684 -1.2205807 0.23319949
## Rate         0.1167322  0.1306268  0.8936312 0.37971687

View a summary of model diagnostics:

View residuals plotted against each covariate…

\(Y\) versus \(\hat{Y}\)

Used to assess whether your model is doing a good job of modeling the response. If it is, you’ll see points along the identity line. If it’s not, there will be non-linear structure try to correct by transforming the response and assess on a predictor-by-predictor basis using marginal model plots.

The standardized residual plots and the plot of \(y\) on \(\hat{y}\) suggest that something is amiss, but what? We need to be sure that the structure in the data is being mirrored well by the structure in our model. This comparison is made for each predictor using the marginal model plot.

Marginal Model Plots are used to assess the marginal relationship between each predictor and the response. -It compares the fit from the model with the nonparametric fit to the scatterplot. -If your model is well-specified, these two lines will be close to coincident.

You can build them by hand using loess() or use mmp() in the car package.

Now, load the car package (if you haven’t already) and produce the marginal model plots…

Or you can use loess to make these by hand:

An alternative model

\[ \widehat{\sqrt{Defective}} \sim Temperature + Density + Rate \]

defects <- transform(defects, sqrtDefective = sqrt(Defective))
m2 <- lm(sqrtDefective ~ Temperature + Density + Rate, data = defects)
summary(m2)$coef
##                Estimate Std. Error   t value   Pr(>|t|)
## (Intercept)  5.59297089 5.26400977  1.062492 0.29778175
## Temperature  1.56516337 0.66225665  2.363379 0.02586654
## Density     -0.29166383 0.11953593 -2.439968 0.02181531
## Rate         0.01289857 0.01043012  1.236666 0.22726602

Marginal model plots for second model

Comparing m1 and m2

How do these look?

Recap: MMP vs AVP

  • Marginal model plots: are useful in checking to see that you’re doing a good job of modeling the marginal relationship between a given predictor and the response.
  • Added variable plots: assess how much variation in the response can be explained by a given predictor after the other predictors have already been taken into account (links to p-values).

Goal check?

Questions?